When we talk about Data Science and the Data Science Pipeline, we are typically talking about the management of data flows for a specific purpose - the modeling of some hypothesis. The models that we construct can then be used in Data Products as an engine to create more data and actionable results. Machine learning is the art of training some model by using existing data along with a statistical method to create a parametric representation of a model that fits the data. That’s kind of a mouthful, but what that essentially means is that a machine learning algorithm uses statistical processes to learn from examples, then applies what it has learned to future inputs to predict an outcome.
Machine learning can classically be summarized with two methodologies: supervised and unsupervised learning. In supervised learning, the “correct answers” are annotated ahead of time and the algorithm tries to fit a decision space based on those answers. In unsupervised learning, algorithms try to group like examples together, inferring similarities via distance metrics. Machine learning allows us to handle new data in a meaningful way, predicting where new data will fit into our models.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this notebook is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology. For more on Scikit-Learn see: Six Reasons why I recommend Scikit-Learn (O’Reilly Radar).
In [5]:
%matplotlib inline
# Things we'll need later
import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
from sklearn import cross_validation as cv
# Load the example datasets
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_linnerud
# Boston house prices dataset (reals, regression)
boston = load_boston()
print "Boston: %i samples %i features" % boston.data.shape
# Iris flower dataset (reals, multi-label classification)
iris = load_iris()
print "Iris: %i samples %i features" % iris.data.shape
# Diabetes dataset (reals, regression)
diabetes = load_diabetes()
print "Diabetes: %i samples %i features" % diabetes.data.shape
# Hand-written digit dataset (multi-label classification)
digits = load_digits()
print "Digits: %i samples %i features" % digits.data.shape
# Linnerud psychological and exercise dataset (multivariate regression)
linnerud = load_linnerud()
print "Linnerud: %i samples %i features" % linnerud.data.shape
The datasets that come with Scikit Learn demonstrate the properties of classification and regression algorithms, as well as how the data should fit. They are also small and are easy to train models that work. As such they are ideal for pedagogical purposes. The datasets
module also contains functions for loading data from the mldata.org repository as well as for generating random data.
In [6]:
import pandas as pd
from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(iris.data)
df.columns = iris.feature_names
fig = scatter_matrix(df, alpha=0.2, figsize=(16, 10), diagonal='kde')
In [7]:
df = pd.DataFrame(diabetes.data)
fig = scatter_matrix(df, alpha=0.2, figsize=(16, 10), diagonal='kde')
In [8]:
import random
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
In [9]:
from sklearn.linear_model import LinearRegression
# Fit regression to diabetes dataset
model = LinearRegression()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [10]:
import skflow
model = skflow.TensorFlowLinearRegressor(steps=10000)
model.fit(diabetes.data, diabetes.target, logdir='/tmp/skflow/linear-regression/')
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
import tensorflow as tf
import skflow
options = [[1], [10], [20], [25], [30], [40]]
for hidden_units in options:
print "hidden layers = ", str(hidden_units)
def tanh_dnn(X, y):
features = skflow.ops.dnn(X, hidden_units=hidden_units,
activation=skflow.tf.tanh)
return skflow.models.linear_regression(features, y)
model = skflow.TensorFlowEstimator(model_fn=tanh_dnn, n_classes=0,
steps=1000, learning_rate=0.1, batch_size=100, verbose=2)
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.linear_model import Perceptron
model = Perceptron()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
import tensorflow as tf
import skflow
options = [[1,1], [10, 10], [15, 15], [20,20], [25,25]]
for hidden_units in options:
print "hidden layers = ", str(hidden_units)
def tanh_dnn(X, y):
features = skflow.ops.dnn(X, hidden_units=hidden_units,
activation=skflow.tf.tanh)
return skflow.models.linear_regression(features, y)
model = skflow.TensorFlowEstimator(model_fn=tanh_dnn, n_classes=0,
steps=5000, learning_rate=0.1, batch_size=100)
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(diabetes. model.fit(diabetes.data, diabetes.target, logdir='/tmp/skflow/decision-tree/')
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.ensemble import AdaBoostRegressor
model = AdaBoostRegressor()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.svm import SVR
model = SVR()
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
Regularization methods decrease the over-fitting of a model by penalizing complexity. These are usually demonstrated on regression algorithms, which is why they are included in this section.
Also known as Tikhonov regularization penalizes a least squares regression model on the square of the absolute magnitiude of the coefficients (the L2 norm)
In [ ]:
from sklearn.linear_model import Ridge
model = Ridge(alpha=0.1)
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
In [ ]:
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1)
model.fit(diabetes.data, diabetes.target)
expected = diabetes.target
predicted = model.predict(diabetes.data)
# Evaluate fit of the model
print "Mean Squared Error: %0.3f" % mse(expected, predicted)
print "Coefficient of Determination: %0.3f" % r2_score(expected, predicted)
Classification is a supervised machine learning problem where, given labeled input data (with two or more labels), the task is to fit a function that can predict the discrete class of input data.
Fits a logistic model to data and makes predictions about the probability of a categorical event (between 0 and 1). Logistic regressions make predictions between 0 and 1, so in order to classify multiple classes a one-vs-all scheme is used (one model per class, winner-takes-all).
In [ ]:
from sklearn.linear_model import LogisticRegression
splits = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = LogisticRegression()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
import skflow
splits = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = skflow.TensorFlowLinearClassifier(n_classes=3,
steps=5000, learning_rate=0.1, batch_size=100)
model.fit(X_train, y_train, logdir='/tmp/skflow/logistic-regression/')
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
model_path = '/tmp/skflow_models/logistic-regression'
model.save(model_path)
restored_model = skflow.TensorFlowEstimator.restore(model_path)
print predicted == restored_model.predict(X_test)
In [ ]:
from sklearn.lda import LDA
splits = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = LDA()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
from sklearn.naive_bayes import GaussianNB
splits = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = GaussianNB()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
splits = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = KNeighborsClassifier()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
from sklearn.tree import DecisionTreeClassifier
splits = cv.train_test_split(iris.data, iris.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
from sklearn.svm import SVC
kernels = ['linear', 'poly', 'rbf']
splits = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
for kernel in kernels:
if kernel != 'poly':
model = SVC(kernel=kernel)
else:
model = SVC(kernel=kernel, degree=3)
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
splits = cv.train_test_split(digits.data, digits.target, test_size=0.2)
X_train, X_test, y_train, y_test = splits
model = RandomForestClassifier()
model.fit(X_train, y_train)
expected = y_test
predicted = model.predict(X_test)
print classification_report(expected, predicted)
Clustering algorithms attempt to find patterns in unlabeled data. They are usually grouped into two main categories: centroidal (find the centers of clusters) and hierarchical (find clusters of clusters).
In order to explore clustering, we'll have to generate some fake datasets to use.
In [ ]:
from sklearn.datasets import make_circles
from sklearn.datasets import make_moons
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
N = 1000 # Number of samples in each cluster
# Some colors for later
colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk'])
colors = np.hstack([colors] * 20)
circles = make_circles(n_samples=N, factor=.5, noise=.05)
moons = make_moons(n_samples=N, noise=.08)
blobs = make_blobs(n_samples=N, random_state=9)
noise = np.random.rand(N, 2), None
# Let's see what the data looks like!
fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
X, y = dataset
X = StandardScaler().fit_transform(X)
plt.subplot(1,4,idx+1)
plt.scatter(X[:,0], X[:,1], marker='.')
plt.xticks(())
plt.yticks(())
plt.ylabel('$x_1$')
plt.xlabel('$x_0$')
plt.show()
In [ ]:
from sklearn.cluster import MiniBatchKMeans
fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
X, y = dataset
X = StandardScaler().fit_transform(X)
# Fit the model with our algorithm
model = MiniBatchKMeans(n_clusters=2)
model.fit(X)
# Make Predictions
predictions = model.predict(X)
# Find centers
centers = model.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.subplot(1,4,idx+1)
plt.scatter(X[:, 0], X[:, 1], color=colors[predictions].tolist(), s=10)
plt.xticks(())
plt.yticks(())
plt.ylabel('$x_1$')
plt.xlabel('$x_0$')
plt.show()
Clustering based on the concept of "message passing" between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm. Like k-medoids, AP finds "exemplars", members of the input set that are representative of clusters
In [ ]:
from sklearn.cluster import AffinityPropagation
fig, axe = plt.subplots(figsize=(18, 4))
for idx, dataset in enumerate((circles, moons, blobs, noise)):
X, y = dataset
X = StandardScaler().fit_transform(X)
# Fit the model with our algorithm
model = AffinityPropagation(damping=.9, preference=-200)
model.fit(X)
# Make Predictions
predictions = model.predict(X)
# Find centers
centers = model.cluster_centers_
center_colors = colors[:len(centers)]
plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors)
plt.subplot(1,4,idx+1)
plt.scatter(X[:, 0], X[:, 1], color=colors[predictions].tolist(), s=10)
plt.xticks(())
plt.yticks(())
plt.ylabel('$x_1$')
plt.xlabel('$x_0$')
plt.show()
In [ ]: